QBiC


1 Introduction

These experiments investigate how splitting affects runtime, and storage usage. CPU & memory were kept at the same values if at all possible

1.1 Goal

We added more options to parallelise along the genome.

1.1.1 Run commands

nextflow run nf-core/sarek -r 3.1.1 -profile cfc --input --outdir -c trace.config -c custom.config --nucleotides_per_second 

custom.config

This runs through mapping, duplicate marking, BQSR, and QC, Variant calling for BWA & non-spark GATK implementation.

2 Loading the dataset

Loading all the individual samples. Print sample summary if possible (e.g. metadata sheet).

name nucleotides_per_second (number of intervals) num of cpus for fastp tower id work sizes trace
fastp4_intervals78 10001 (78) 4, --split_fastq 10000000000 https://cfgateway1.zdv.uni-tuebingen.de/orgs/QBiC/workspaces/cfc/watch/PE2um0F1SrwRi yes yes
fastp8_intervals40 70000 (40) 8, --split_fastq 500000000 https://cfgateway1.zdv.uni-tuebingen.de/orgs/QBiC/workspaces/cfc/watch/52L0P2tE99JYXs yes yes
fastp12 skipped 12, —split_fastq 100000000 https://cfgateway1.zdv.uni-tuebingen.de/orgs/QBiC/workspaces/cfc/watch/ocENvYnNRQAGC yes yes
fastp16_intervals1 5000000 (1) 16, --split_fastq 100000000 https://cfgateway1.zdv.uni-tuebingen.de/orgs/QBiC/workspaces/cfc/watch/2uPwaXSKrcUaKq yes yes
fastp0_intervals20 200000 (21) 0, --split-fastq 0 http://cfgateway1.zdv.uni-tuebingen.de/orgs/QBiC/workspaces/cfc/watch/5iFlE6AEOimMO9 yes yes

3 Methods

3.1 Mapping

3.1.1 FastP

plot_dataflow_single_process(df_max_time = merged_formatted_fastp$time, 
                  df = merged_formatted_fastp$process,
                  df_storage = merged_formatted_fastp$storage,
                  group = "fastp",
                  title = "FastP",
                  xaxis = "# shards",
                  outputname = "fastp",
                  results_folder = results_folder)

3.1.2 BWA Mem

3.1.3 Markduplicates

3.1.4 Summary

3.2 Base quality score recalibration

3.2.1 Baserecalibrator

3.2.2 GatherBQSRReports

3.2.3 ApplyBQSR

3.2.4 Merge CRAM

3.2.5 Summary

3.3 Variant calling

3.3.1 Variant callers (Germline)

3.3.1.1 Deepvariant

## `summarise()` has grouped output by 'intervals', 'simple_name_combined'. You
## can override using the `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.

3.3.1.2 Freebayes

## `summarise()` has grouped output by 'intervals', 'simple_name'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.

3.3.1.3 Haplotypecaller

## `summarise()` has grouped output by 'intervals', 'simple_name'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.

3.3.1.4 Strelka

## `summarise()` has grouped output by 'intervals', 'simple_name_combined'. You
## can override using the `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.

3.3.1.5 Combine Plot

## Saving 7 x 5 in image

3.3.2 Variant callers (Somatic)

3.3.2.1 Mutect2

## `summarise()` has grouped output by 'intervals', 'simple_name'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.

3.3.2.2 Strelka

## `summarise()` has grouped output by 'intervals', 'simple_name_combined'. You
## can override using the `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.

3.3.2.3 Freebayes

## `summarise()` has grouped output by 'intervals', 'simple_name'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'intervals'. You can override using the
## `.groups` argument.

3.3.2.4 Combined plot

## Saving 7 x 5 in image

4 Package versions:

##   ggpattern       tidyr viridisLite      ggpubr  kableExtra       knitr 
##     "1.0.1"     "1.3.0"     "0.4.1"     "0.5.0"     "1.3.4"      "1.42" 
##     cowplot     ggplot2     forcats   patchwork       dplyr 
##     "1.1.1"     "3.4.0"     "1.0.0"     "1.1.2"     "1.1.0"